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Abstract 

Background 

The cellular function of a vast majority of proteins is performed through physical interactions with other 
biomolecules, which, most of the time, are other proteins. Peptides represent templates of choice for mim- 
icking a secondary structure in order to modulate protein-protein interaction. They are thus an interesting class 
of therapeutics since they also display strong activity, high selectivity, low toxicity and few drug-drug interactions. 
Furthermore, predicting which peptides would bind to which MHC alleles would be of tremendous benefit to 
improve vaccine based therapy and possibly generating antibodies with greater affinity. Modern computational 
methods have the potential to accelerate and lower the cost of drug and vaccine discovery by selecting potential 
compounds for testing in silico prior to biological validation. 

Results 

We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. 
The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such 
as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low 
complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm 
for it's approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the 
proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first 
time, a machine learning predictor is capable of accurately predicting the binding affinity of any peptide to any 
protein. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex 
class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. 

Conclusion 

On all benchmarks, our method significantly (p-value < 0.057) outperforms the current state-of-the-art methods 
at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any 
quantitative biological activity. Moreover, generating reliable peptide-protein binding affinities will also improve 
system biology modelling of interaction pathways. Lastly, the method should be of value to a large segment of 
the research community with the potential to accelerate peptide-based drug and vaccine development. 
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1 Background 

The cellular function of a vast majority of proteins is performed through physical interactions with other 
biomolecules, which, most of the time, are other proteins. Indeed, essentially all of the known cellular 
and biological processes depend, at some level, on protein- protein interactions (PPI) [1,2]. Therefore, the 
controlled interference of PPI with chemical compounds provides tremendous potential for the discovery of 
novel molecular tools to improve our understanding of biochemical pathways as well as the development of 
new therapeutic agents [3,4]. 

Considering the nature of the interaction surface, protein secondary structures are essential for binding 
specifically to protein interaction domains. Peptides also represent templates of choice for mimicking a 
secondary structure in order to modulate protein-protein interactions [5,6]. Furthermore, they are a very 
interesting class of therapeutics since they display strong activity, high selectivity, low toxicity and fewer 
drug-drug interactions. They can also serve as investigative tools to gain insight into the role of a protein, 
by binding to distinct regulatory regions to inhibit specific functions. 

Yearly, large sums of monies are invested in the process of finding draggable targets and identifying 
compounds with medicinal utility. The widespread use of combinatorial chemistry and high-throughput 
screening in the pharmaceutical and biotechnology industries implies that millions of compounds can be 
tested for biological activity. However, screening large chemical libraries generates significant rates of both 
false positives and negatives. The process is expensive and faces a number of challenges in testing candidate 
drugs and validating the hits, all of which must be done efficiently to reduce costs and time. Computational 
methods with reasonable predictive power can now be envisaged to accelerate the process providing an 
increase in productivity at a reduced cost. 

Furthermore, peptides ranging from 8 to 12 AA represent the recognition unit for the immune system 
and in particular the MHC (Major Hiscompatibility Complex). Being capable of predicting which peptides 
would bind to which MHC alleles would be of tremendous benefit to improve vaccine based therapy, possibly 
generating antibodies with greater affinity that could yield an improved immune response. Moreover, simply 
having data on the binding affinity of peptides and proteins could readily assist system biology modelling of 
interaction pathways. 

The ultimate goal is to build a predictor of the highest binding affinity peptides. If one had a fast 
and accurate peptide-protein binding affinity predictor, one could easily predict the affinity of huge sets of 
peptides and select the best candidates, or use stochastic optimization methods if the set of peptides were 
too large. The task would be facilitated if one could build such a fast and accurate predictor. This paper 
provides a step in this direction with the use of a machine learning algorithm based on kernel methods and 
a novel kernel. 

Traditional machine learning approaches used to focus on binary classification of compounds (binding, 
non-binding) [7]. Non-binding compounds are rarely known and available quantitative binding affinity in- 
formation is lost during training, a major obstacle to binary classification. New databases, such as the PcpX 
database, contain binding affinities between peptides and a large group of protein families. The first part of 
this paper presents a general method for learning a binding affinity predictor between any peptide and any 
protein, a novel machine learning contribution to biology. 

The Immune Epitope Database (IEDB) [8] contains a large number of binding affinities between peptides 
and Major Histocompatibility Complex (MHC) alleles. Predicting methods for MHC class I alleles have 
already had great success [9] . The simpler binding interface of MHC-I molecules makes the learning problem 
significantly easier than for MHC-II molecules. Allele specific (single-target) methods for MHC class II 
alleles have also reasonable accuracy, despite requiring a large number of training data for every allele in 
order to achieve reasonable accuracy [9]. Pan-specific (multi-target) methods, such as MultiRTA [10] and 
NetMHCIIpan-2.0 [11], were designed in order to overcome this issue. These methods can predict, with 
reasonable accuracy, the binding affinity of a peptide to any MHC allele, even if this allele has no known 
peptide binders. 

We propose a new machine learning approach based on kernel methods [12] capable of both single-target 
and multi-target (pan-specific) prediction. We searched for kernels that encode relevant binding information 
for both proteins and peptides. Therefore, we propose a new kernel, the GS kernel, that generalizes most 
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of kernels currently used in this setting (RBF [12], Blended spectrum [12], Oligo [13], Weighted Degree [14], 
...). The GS kernel is shown to be a suitable similarity measure between peptides and pseudo-sequences of 
MHC-II binding interfaces. 

For the machine learning algorithm itself, we show that kernel ridge regression [12] (KRR) is generally 
preferable to the support vector regression (SVR) algorithm [15] because KRR has less hyperparameters to 
tune than SVR, thus making the learning time smaller. The regression score that we obtain on PepX is 
competitive with the ones that we obtain on data sets containing peptides binding to a single protein, even 
if the former task is, in theory, much more difficult. For the peptide-MHC binding problem, comparison on 
benchmark datasets show that our algorithm surpasses NetMHCIIpan-2.0 [11], the current state-of-the-art 
method. Indeed, in the most difficult pan-specific case (when the algorithm is trained on all alleles except 
the allele used for testing), our algorithm performs better than the state of the art in most cases. Finally, our 
method outperform SVR on three quantitative structure affinity model (QSAM) single-target predictions 
benchmarks [16]. We thus propose a machine learning approach to immunology and a novel string kernel 
which have shown to yield impressive results on benchmark datasets for various biological problems. 



2 Methods 

2.1 Statistical machine learning and kernel ridge regression in our context 

Given a set of training examples (or cases), the task of a learning algorithm is to build a accurate predictor. 
In this paper, each example will be of the form ((x, y),e), where x represents a peptide, y represents a 
protein, and e is a real number representing the binding energy (or the binding affinity) between the peptide 
x and the protein y. A multi-target predictor is a function h that returns an output /i(x, y) when given any 
input (x, y). In our setting, the output h(x, y) is a real number estimate of the "true" binding energy (or 
the binding affinity) e between x and y. The predictor h is accurate on example ((x,y),e) if the predicted 
output /i(x, y) is very similar to the real output e. A predictor is good when it is accurate on most future 
examples unseen during training. 

With kernel methods, each input (x, y) is mapped to a feature vector </>(x, y) = 
(0i (x, y), </>2(x, y), . . . , 0d(x, y)) of large dimensionality d. Moreover, the predictor is represented by 
a real- valued weight vector w that lies in the space of feature vectors. Given an arbitrary input (x, y), the 
output of the predictor /i w is given by the scalar product 

d 

/i w (x,y) = w<£(x,y) = ^w 4 ^(x,y). 

i=l 

The lost incurred by predicting a binding energy /i w (x,y) on input (x, y), when the true binding energy 
is e, is measured by a loss function £(w, (x,y),e). As is usual in regression, we will use the quadratic loss 
function 

^(w, (x, y), e) = (e - w • <£(x, y)) 2 . 

The fundamental assumption in machine learning is that each example ((x,y),e) is drawn according to 
some unknown distribution D. Then the task of the learning algorithm is to find the predictor /i w having 
the smallest possible risk i?(/i w ) defined as the expected loss 

fl(fcw) = „ E *(w,(x,y),e). 

((x,y),e)~D 

However, the learning algorithm does not have access to D. Instead, it has access to to a training 
set S = {((xi,yi),ei),((x2,y 2 ),e 2 ),...,((x m ,y m ),e m )} of m examples where each example ((x i ,y i ),e i ) 
is assumed to be generated independently according to the same (but unknown) distribution D. Modern 
statistical learning theory [12,17] tells us that the predictor h w minimizing the ridge regression cost function 
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F(S, w) will have a small risk i?(/i w ) whenever the obtained value of F(S,w) is small. Here, F(S,w) is 
defined as 



F(S,w) d = f ||w|| 2 + C^(w,(x 4 ,y t ), ei )= || w || 2 + C^(e i -w^(x i ,y i )) 2 



1=1 



for some suitably-chosen constant C > 0. The first term of F(S,w), ||w|| 2 = f w • w, which is the squared 
Euclidean norm of w, is called a regularizer and it penalizes predictors having a large norm (complex 
predictors). The second term measures the accuracy of the predictor on the training data. Consequently, 
the parameter C controls the complexity-accuracy trade-off. Its value is usually determined by measuring 
the accuracy of the predictor on a separate ( "hold-out" ) part of the data that was not used for training, or 
by more elaborate sampling methods such as cross-validation. 

The representer theorem [12,17] tells us that the predictor w* that minimizes F(S,w) lies in the linear 
subspace span by the training examples. In other words, we can write 

m 

»=i 

where the coefficients Qj are called the dual variables and provide collectively the dual representation of the 
predictor. This change of representation makes the cost function dependent on ^(x,) only via the scalar 
product 0(xj,yj) • <j>(x.j,yj) ~^ fc((xj, y^), (xj, y^)) for each pair of examples. The function k is called a 
kernel and has the property of being efficiently computable for many feature maps <j>, even if the feature 
space induced by <j> has an extremely large dimensionality. By using k instead of <j>, we can construct linear 
predictors in feature spaces of extremely large dimensionality with a running time that scales only with the 
size of the training data (with no dependence on the dimensionality of <f>) . This fundamental property is also 
known as the kernel trick [12,17]. It is important to point out that, since a kernel corresponds to a scalar 
product in a feature space, it can be considered as a similarity measure. A large (positve) value of the kernel 
normally implies that the corresponding feature vectors point in the same direction (similar), although a 
value close to zero normally implies that the two feature vectors are mostly orthogonal (dissimilar). 

We restrict ourselves to joint feature maps having the form <f>(x, y ) = <j> x (x) ® <j>y (y) where (g> denotes the 
tensor product. The tensor product between two vectors a = (a\, . . . , a n ) and b = (b\, . . . , b m ) denotes the 
vector a® b = (a\bi, a\b 2l ■ ■ ■ , a n b m ) of all the nm products between the components of a and b. The tensor 
product allows us to express the joint feature space (j) in terms of the feature space 4>x °f the peptides and 
the feature space (j>y of the proteins. If we now define the peptide kernel kx by fc^(x, x') = ^(x) • ^(x'), 
and the protein kernel ky by ky{y 7 y') = f <j>y{y) -(fryiy'), the joint kernel k simply decomposes as the product 
of kx and ky as 

fc((x,y),(x',y')) = ^x,y).^(x',y') 

= <M x )®<My) ®<My') 

= (^(x).<Mx'))(^(y)-«My / )) 
= k x (x,x')ky(y,y') . 

Consequently, from the representer theorem we can write the multi-target predictor as 

m 

/i w .(x,y) = w* -<A(x,y) = w* • (^(x) ® (f> y (y)) = ^ otikxfc, x)k y (yi, y) . 

i=i 

In the case of the quadratic loss £(w, (x, y), e) = (e — w • ^(x, y)) 2 , F(S, w) is a strongly convex function 
of w for any strictly positive C. In that case, there exists a single local minimum which coincides with the 
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global minimum. This single minimum is given by the point w* where the gradient dF(S,w)/dw vanishes. 
For the quadratic loss, this solution w* is given by 

a=(K+±l) e, (1) 

where a = (ai,...,a m ), e = f (ei,...,e m ), K denotes the Gram matrix of kernel values Kij = 
kx(xi,x.j)ky(yi,yj), and I denotes de to x to identity matrix. Hence, the learning algorithm for kernel 
ridge regression just consists at solving Equation (1). Note that for a symmetric positive semi-definite kernel 
matrix K, the inverse of K + I/C always exists for any finite value of C > 0. Note also that the inverse of 
an m x to matrix is obtained in 0(m 3 ) time with the Gaussian-elimination method and in 0(to 2 376 ) time 
with the Coppersmith- Winograd algorithm. 

Finally, we will also consider the single protein target case where only one protein y is considered. In 
this case, the predictor h w predicts the binding energy from a feature vector 4> x constructed only from the 
peptide. Hence, the predicted binding energy for peptide x is now given w • 4> x (x.). So, in this single protein 
target case, the cost function to minimize is still given by F(S, w) but with </>(x, y) = ^(x). Consequently, 
in this case, the solution is still given by Equation (1) but with a kernel matrix K given by Kij — fc^(xj,Xj). 
The single-target predictor is thus given by 

m 

h w .(x) = w* • <j> x (x) = ^a l fc A -(x l ,x) . 

»=i 

Kernel methods have been extremely successful within the last decade, but the choice of the kernel is 
critical for obtaining good predictors. Hence, confronted with a new application, we must be prepared to 
design an appropriate kernel. The next subsections show how we have designed and chosen both peptide 
and protein kernels. 



2.2 A generic string (GS) kernel for small bio-molecule strings 

String kernels for bio-molecules have been applied with success in bioinformatics and computational biology. 
Kernels for large bio-molecules, such as the local-alignment kernel [18] have been well studied and applied 
with success to problems such as protein homology detection. However, we observed that these kernels 
perform rather poorly on smaller compounds (data not shown). Consequently, kernels designed for smaller 
bio-molecules like peptides and pseudo sequences have recently been proposed. Some of these kernels [13] 
exploit sub-string position uncertainty while others [19] use physicochemical properties of amino acids. We 
present a kernel for peptides that exploits both of these properties in a unified manner. 

The proposed kernel, which we call the generic string (GS) kernel, is a similarity measure defined for any 
pair (x, x') of strings of amino acids. Let £ be the set of all amino acids. Then, given any string x of amino 
acids (e.g., a peptide), let |x| denote the length of string x, as measured by the number of amino acids in x. 
The positions of amino acids in x are numbered from 1 to |x|. In other words, x = xi,x 2 , ■ ■ ■ , £| x | with all 
Xi € S. 

Now, let ij) : £ — > R d be an encoding function such that for each amino acid a, 

lp(a) = (ip 1 (a),i> 2 (a),...Ma)) (2) 

is a vector where each component Vi ( a ) encodes one of the d properties (possibly physicochemical) of amino 
acid a. In a similar way, we define tp l : S' — > R dl as an encoding function for strings of length I. Thus, 
(a) encodes all I amino acids of a concatenning I vectors, each of d components: 

il>\ai,a 2 ,..,ai) = (^(a 1 ),^)(a 2 ), • . . ,tp{ai)) (3) 
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Fixed parameters 


Free parameters 


Kernel name 


L = 1, a p — > 0, <r c — >• 




Hamming distance 


L — » oo, (T p — > 0, <j c — » 




Dirac delta 


(T p — > OO , (T c — > 


L 


Blended Spectrum [12] 


(Tp — > OO 


I/, CT C 


Blended Spectrum RBF [19] 


(Tc 


L, (Tp 


Oligo [13] 


L — > oo, (Tp — > 




Radial Basis Function (RBF) 


(Tp — > 0, a c — > 


L 


Weighted degree [14] 


(Tp -» 


L,a c 


Weighted degree RBF [19] 




L,(Tp,a c 


Generic String (GS) 



Table 1: Special cases of the GS kernel: eight know kernels can be obtained by fixing different parameters 
of the GS kernel. 



Let L > 1 be a maximum length for substring comparison. We define the generic string (GS) kernel as 
the following similarity function over any pair (x, x') of strings: 

gs(x,x',£,^ c ) = EE E e[ 2 ° l > ^ 2CTc >■ ( 4 ) 

1=1 i=0 j=0 

In other words, this kernel compares each substring Xi + i,x i+2 , --,x i+ i of x of size / < L with each substring 
x'j +1 ,Xj +2 ...,Xj +l of x' having the same length. Each substring comparison yields a score that depends on 
the -0-similarity of their respective amino acids and a shifting contribution term that decays exponentially 
rapidly with the distance between the starting positions of the two substrings. The a p parameter controls 
the shifting contribution term. The a c parameter controls the amount of penalty incurred when the encoding 
vectors ip l (x i+ i, ..,x i+ i) and ij} {x'- +k , ..,x'- +l ) differ as measured by the squared Euclidean distance between 
these two vectors. The GS kernel outputs the sum of all the substring-comparison scores. 

Also, note that the GS kernel can be used on strings of different lengths, which is a great advantage over 
a localized string kernel (of fixed length) such as the RBF and the weighted degree kernels [14, 19]. In fact, 
the GS kernel generalizes eight known kernels. Table 1 lists them with the fixed and free parameters. For 
example, when a v approaches +00 and a c approaches 0, the GS kernel becomes identical to the blended 
spectrum kernel [12], which has a free parameter L representing the maximum length of substrings. The 
free parameter values are usually determined by measuring the accuracy of the predictor on a separate 
("hold-out") part of the data that was not used for training, or by more elaborate sampling methods such 
as cross-validation. 

In the next subsection, we prove that the GS kernel is symmetric positive semi-definite and, therefore, 
defines a scalar product in some large-dimensional feature space (see [12]). In other words, for any hyper- 
parameter values (L,a v ,a c ), there exists a function $ Xli transforming each finite sequence of amino 
acids into a vector such that 

G5(x, x', L, (Tp, a c ) - <t> X(L ^ ac) (x) • <l>x (L ^ c) (*') • 

Consequently, the solution minimizing the ridge regression functional F(S,w) will be given by Equation (1) 
and is guaranteed to exist whenever the GS Kernel is used. 



2.2.1 Symmetric positive semi-definiteness of the GS kernel 

The fact that the GS kernel if positive semi-definite follows from the following theorem. The proof is provided 
as supplementary material [see Appendix]. 
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Theorem 1 Let E be an alphabet (say the alphabet of all the amino acids). For each I G {1,..,L}, let 
K[ : E' x Ti l — > M be a symmetric positive semi- definite kernel. Let A : R — > R be any function that is the 
convolution of another function B : R — > R, i.e., for all z, z' G R, we have 

/+oo 
B(z — t)B'(z' — t) At. 
-oo 

Then, the kernel K defined, for any two strings on the alphabet E, as 

L |x|-i |x'|-« 

if(x,x') = ^ X) ^M i -j) K l(( x i+U--> x i+l)A x 'j+l>--, x 'j+l)) 
1=1 i=0 j=0 

is also symmetric positive semi-definite. 

The positive semi-definiteness of the GS kernel comes from the fact that the GS kernel is a particular 
case of the more general kernel K defined in the above theorem. Indeed, we first note that both kernels are 
identical except A(i — j) in kernel K is specialized to exp( ~^~^ ) in the GS kernel, and Ki(y , y') in kernel 

V 

— \\il> l (y) —tl> l (y)\\ 2 

K is specialized to exp(— - — ^ 2 — ) in the GS kernel. Note that this last exponential is just an RBF 
kernel (see [12] for a definition) that is defined over vectors of R ld of the form il> l (y); it is therefore positive 
semi-definite for any I G {1,2, ..,£}. For the first exponential, note that cxp( ~2~2 ^ ) is a function that is 
obtained from a convolution from another function since we can verify that 

Indeed, this equality is a simple specialization of Equation (4.13) of [20]. It is related to the fact that the 
convolution of two Normal distributions is still a Normal distribution. 

Finally, it is interesting to point out that Theorem 1 can be generalized to any function A on measurable 
sets M (not only the ones that are defined on R), provided that A is still is a convolution of another function 
B : M — > M. We omit this generalized version in this paper since Theorem 1 suffices to prove that the GS 
kernel is positive semi-definite. 

2.2.2 Efficient computation of the GS kernel 

To cope with today's data deluge, the presented kernel should have a low computational cost. For this task, 
we first note that, before computing G5(x, x', L, cr p , a c ) for each pair (x, x') in the training set, we can first 
compute 

d 

E(a,a') d ^ |^(a)-^(a')|| 2 = ^(^ P (a) - ^ p (a')) 2 , 

P =i 

for each pair (a, a') of amino acids. After this pre-computation stage, done in 0(d ■ |E| 2 ) time, each access 
to E(a, a') is done in O(l) time. We will not consider the running time of this pre-computation stage in 
the complexity analysis of the GS kernel, because it only has to be done once to be valid for any 5-tuplc 
(x,x',L,ct p ,(7 c ). Let us recall that, in kernel methods, the kernel has to be evaluated a huge number of 
times, therefore any reasonable pre-computation can be omitted in the complexity analysis of such kernel. 

Now, recall that we have defined ip l : E' — > R as the concatenation of vectors of the form ip(a) (see 
Equation (2)). Hence ||^>'(a) — i/> (a')|| is an Euclidian norm, and we have that 

||^(a)-VV)|| 2 = Ell^)-^)ll 2 = I>KX) (5) 

k=l fe=l 
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Following this, we can now write the GS kernel as follows 

L |x|-J|x'|-« ( d=Zlf 

:E E 

1=1 i=0 j=0 



-Sfc=i E(x i+k ,x' j+k ) 

GS(xy,L,a p ,a c ) = E E E eV ^ ^ ^ 7 ( 6 ) 



|x| |x I / L! £1 \ min(L,|x|-»,|x —5 J -±— 



= EI> V 2 °* ; £ ^ 2 " ? '■ m 

i=0 3=0 1=1 

where mm(L, |x| — i, |x'| — j) is used in order to assure that i + k and j + k are valid positions in strings x 
and x'. 

Now, for any L, |x|, |x'|, and any i £ {!,..., |x|},j e {1, . . . , |x'|}, let 



min(L,|x|-i,|x'|-j) 



■ELl E(x i+k ,x' j+k ) 



Bi,= E eV J (8) 



We therefore have 



GS(x, x', L, a p , tic) = E E CX P f ( o 2 J)2 ) ' B ^ ■ ( 9 ) 

Since min(i, |x| — i, |x'| — j) < L, we see, from Equation (8), that the computation of each entry B it j seems to 
involve 0(L 2 ) operations. However, we can reduce this complexity term to O(L) by a dynamic programming 
approach. Indeed, consider the following recurrence: 

( 1 if k = 

tfc = < Azf<wlW^ (10) 

[tfc_i-eV 2 " r / otherwise. 

We thus have 

min(i,|x|-i,|x'|-j) 

B itj = E tk ( n ) 

fe=l 

The computation of each entry Bjj therefore involves only 0(L) operations. Consequently, the running time 
complexity of the GS kernel is O (|x| • |x'| • L). 



2.2.3 GS Kernel relaxation 

We now show how to compute a very close approximation of the GS kernel in linear time. Such a feature is 
interesting if one wishes to do a pre or post treatment where the symmetric positive semi-dchnitc (SPSD) 
property of the kernel is not required. For example, as opposed to the training stage where the inverse of 
K + I/C is guaranteed to exists only for a SPSD matrix K, kernel values in the prediction stage could be 
approximate. Indeed, the scalar product with a is defined for non positive semi-definite kernel values. This 
scheme would greatly speed up the predictions with a very small precision lost. 

The shifting penalizing term cxp ^ ~^~P ^ in Equation (4) implies that the further two substrings are 
from each other, no matter how similar they are, their contribution to the kernel will vanish exponentially 
rapidly. Let 6 be the maximum distance between two substrings that we intend to consider in the computation 
of the approximate version of the GS kernel. In other words, any substring whose distance is greater than 5 
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will not contribute. We propose to fix S = [~3<7 P ] . In this case, the contribution of any substring beyond S is 
bound to be minimal. For the purpose of demonstration, let P be the |x| x |x'| matrix 

if \i-j\ > S 
otherwise . 

P is thus a sparse matrix with exactly <5|x| + <5|x'| — S 2 non-zero values around it's diagonal. We can therefore 
write this approximation of the GS kernel as 

1*1 Ix'l 

GS'(x, x', L, <t p , a c , 5)=J2J2 P « ' B ^ ■ ( 13 ) 

j=o j=a 

It is now clear that only values of B for which the value in P is non-zero need to be computed. The complexity 
of GS' is dominated by the computation of matrix B whose <5|x| + <5|x'| — S 2 entries can be computed in 
O (max(|x|, |x'|)). Since L and S are constant factors, we have that GS' G O (max(|x|, |x'Q) with an optimal 
linear complexity. 



2.3 Kernel for protein binding pocket 

Hoffmann ct al. [21] proposed a new similarity measure between protein binding pockets. The similar- 
ity measure aligns atoms extracted from the binding pocket in 3D and assigns a score to the alignment. 
Pocket alignment is possible for proteins that share low sequence and structure similarity. They proposed 
two variations of the similarity measure. The first variation only compares the shape of pockets to assign 
a score. In the second variation, atom properties, such as partial charges, re-weight the contribution of 
each atom to the score. We will refer to these two variations respectively as sup-CK and sup-CK^. Since 
both scores are invariant by rotation and translation, they are not positive semi-definite kernels. To ob- 
tain a valid kernel, we have used the so-called empirical kernel map where each y is mapped explicitly to 
(fe(yi) y)j k(y2, y), • • • , k(y m ,y)). To ensure reproducibility and avoid implementation errors, all experiments 
were done using the implementation provided by the authors. An illustration of the pocket creation for the 
SupCk kernel is shown in Figure 1. 



2.4 Metrics and experimental design 

When dealing with regression values, classical metrics used for classification such as the area under the ROC 
curve (AUC) [22] are not suitable. To compute the AUC, some authors fix a binding threshold value used to 
transform the regression problem into a binary classification problem. The real value output of the predictor 
is also mapped to a binary class using the same threshold value. The AUC is then computed on the binary 
output. Unfortunately, this method makes the AUC metric dependent on the threshold value chosen to 
transform the real output values in the regression problem. For this reason, we decided not to present AUC 
results in the main paper. Those results are nevertheless provided as supplementary material [see Appendix]. 

Metrics such as the root mean squared error (RMSE), the coefficient of determination (R 2 ) and the 
Pearson product-moment correlation coefficient (PCC) are more suited for regression. In this paper, we 
have used the PCC and the RMSE to evaluate the performance of our method. 

Except when otherwise stated, nested cross-validation was done for estimating the PCC and the RMSE 
of the predicted binding affinities. For all n outer folds, n — 1 inner cross-validation folds were used for 
the selection of the kernel hyper-parameters and the C parameter of Equation (1). All reported values 
are computed on the union of the outer fold test set predictions. This is important since an average of 
correlation coefficients is not a valid correlation coefficient. This is also true for the root mean squared error. 
An illustration of the nested cross-validation procedure is shown in Figure 2. 
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Figure 1: A pyMOL illustration of the binding pocket kernel shows a MHC-I molecule B*3501 complexed 
with a peptide (VPLRPMTY) from the NEF protein of HIV1 (PDB ID IAIN). The MHC protein is shown 
in yellow, the peptide in shown in red. 
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Figure 2: Illustration of the nested 10-fold cross-validation procedure. For all of the 10 outer folds, an inner 
9 fold cross-validation scheme was used to select hyper-parameters. 



More precisely, let e denote the average affinity in the data set V. Let T/. for k G {1, . . . , 10} denote the 
testing set of the fc th outer fold and let h v \ Tk (x i; y 4 ) be the predicted binding affinity on example ((x i; y^), e^) 
of the predictor built from V\Tk. Then the correlation coefficient was computed using 



PCC = 



ELi £ lS T fc ( e * ~ Vt, (xi, yj)f_ 



\ Ei G x) ( e i - e ) 2 



An algorithm that, on average, makes the same error as e will give PCC = and an algorithm that 
always returns a perfect predictor will give PCC = 1. 
As for the RMSE, it was computed using 



RAISE = ^fe-VTjx.y,))^ 

V \ T k\ 

Therefore, the perfect predictor will give RAISE — and the value of this metric will increase as the 
quality of the predictor decrease. 

All the p- values reported in this article were computed using the two-tailed Wilcoxon signed- ranked test. 
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2.5 Data 

2.5.1 PepX database 

The PcpX database [23] contains 1431 high-quality peptide-protein complexes along with their protein and 
peptide sequences, high quality crystal structures, and binding energies (expressed in kcal/mol) computed 
using the FoldX force field method. Full diversity of structural information on protein-peptide complexes is 
achieved with peptides bound to, among others, MHC, thrombins, a-ligand binding domains, SH3 domains 
and PDZ domains. This database recently drew attention in a review on the computational design of peptide 
ligands [24] where it was part of large structural studies to understand the specifics of peptide binding. A 
subset of 505 non-redundant complexes was selected based on the dissimilarity of their binding interfaces. 
The authors of the database done the selection in such a way that this smaller subset still represented the 
full diversity of structural information on peptide-protein complexes present in the entire Protein Data Bank 
(PDB), see [23] for a description of the method. We will refer to the smaller subset as the "PepX Unique" 
data set and to the whole data base as "PepX All" . 

The few complexes with positive binding energies were removed from the dataset. No other modifications 
were made on the original database. 

2.5.2 Major histocompatibility complex class II (MHC-II) 

Two different approaches were used for the prediction of MHC class II - peptide binding affinities: single- 
target and multi-target (pan-specific). 

Single-target prediction experiments were conducted using the data from the IEDB dataset proposed by 
the authors of the RTA method [25]. The latter consists of 16 separate datasets, each containing data on 
the peptides binding to an MHC class II allotype. For each allotype, the corresponding dataset contains 
the binding peptide sequences and their binding affinity in kcal/mol. These datasets have previously been 
separated into 5 cross-validation folds to minimize overlapping between peptide sequences in each fold. It is 
well known in the machine learning community that such practice should be avoided, as opposed to random 
fold selection, since the training and test sets are then not independently generated. These predefined folds 
were nevertheless used for the purpose of comparison with other learning methodologies that have used them. 

Pan-specific experiments were conducted on the IEDB dataset proposed by the authors of the NetMHCI- 
Ipan method [26]. The dataset contains 14 different HLA-DR allotypes, with 483 to 5648 binding peptides 
per allotype. For each complex, the dataset contains the HLA allele's identifier (e.g.: DRB1*0101), the 
peptide's sequence and the log50fc transformed IC50 (Inhibitory Concentration 50%), which is given by 
1 - logsoooo ^C50. 

As pan-specific learning requires comparing HLA alleles using a kernel, the allele identifiers contained 
in the dataset were not directly usable for this purpose. Hence, to obtain a useful similarity measure 
(or kernel) for pairs of HLA alleles, we have used the pseudo sequences composed of the amino acids at 
highly polymorphic positions in the alleles' sequences. These amino acids are potentially in contact with 
peptide binders [26], therefore contributing to the MHC molecule's binding specificity. The authors of the 
NctMHCIIpan method proposed using pseudo sequences composed of the amino acids at 21 positions that 
were observed to be polymorphic for HLA-DR, DP and DQ [26]. With respect to the IMGT nomenclature 
[27], these amino acids are located between positions 1 and 89 of the MHC's /3 chain. Pseudo sequences 
consisting of all 89 amino acids between these positions were also used to conduct the experiments. 

2.5.3 Quantitative structure affinity model (QSAM) benchmark 

Three well-studied benchmark datasets for designing quantitative structure affinity models were also used 
to compare our approach: 58 angiotensin-I converting enzyme (ACE) inhibitory dipeptides, 31 bradykinin- 
potentiating pentapeptides and 101 cationic antimicrobial pentadecapeptides. These data sets were recently 
the subject of extensive studies [16] where partial least squares (PLS), Artificial Neural Networks (ANN), 
Support Vector Regression (SVR), and Gaussian Processes (CP) were used to predict the biological activity 
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SVR 


KRR 


sup-CK 


sup-CK 


sup-CK/, 


BS 


BS 


GS 


BS 


GS 


PcpX Unique 


0.6822 


0.7072 


0.7300 


0.7110 


0.7264 


PcpX All 


0.8227 


0.8580 


0.8648 


0.8601 


0.8652 



Table 2: Correlation coefficient (PCC) for multiple target predictions on the PepX database. Best results 
are highlighted in bold. 



of the peptides. GP and SVR were found to have the best results on the testing set, but their experiment 
protocol was unconventional because the training and test sets were not randomly selected from the data set. 
Instead, their testing examples were selected from a cluster analysis performed on the whole data set — thus 
favoring learning algorithms that tend to cluster their predictions according to the same criteria used to split 
the data. Instead, we randomly selected the testing examples from the whole data set — thus favoring no 
algorithm a priori. Theses datasets were chosen to demonstrate the ability of our method to learn on both 
small and large datasets. 

3 Results and discussion 
3.1 PepX database 

To our knowledge, this is the first kernel method attempt at learning a predictor which takes the protein 
crystal and the peptide sequence as input to predict the binding energy of the complex. Many consider this 
task as a major challenge with important consequences across biology. String kernels such as the LA-kernel 
and the blended spectrum were used while conducting experiments on proteins. They did not yield good 
results (data not shown), mainly because they do not consider the protein's secondary structure information. 
To validate this hypothesis and improve our results, we tried using the MAMMOTH kernel, a similarity 
function proposed by [28], recently used for prediction of protein-protein interactions from structures [29]. 
The MAMMOTH uses structural information from crystals to aligns two proteins using their secondary 
structure and assigns a score to the alignment. The greater the similarity between two proteins' secondary 
structure, the greater the alignment score will be. The use of the MAMMOTH kernel did improve the results 
but was still missing an important aspect of protein-peptide interaction. The interaction takes place at a 
very specific location on the surface of the protein called the binding pocket. Two proteins may be very 
different, but if they share a common binding pocket, it is likely that they will bind similar ligands. This is 
the core idea that motivated [21] in the creation of the sup-CK binding pocket kernel. 

Choosing a kernel for the peptides was also a challenging task. Sophisticated kernels for local signals 
such as the RBF, weighted degree, and weighted degree RBF could not be used because peptide lengths were 
not equal. In fact, peptide lengths vary between 5 and 35 amino acids, which makes the task of learning a 
predictor and designing a kernel even more challenging. This was part of our motivation in designing the 
GS kernel. For all experiments, the BLOSUM 50 matrix was found to be the optimal amino acid descriptors 
during cross-validation. 

Table 2 presents the first machine learning results for the prediction of binding affinity given any peptide- 
protein pair. We first observe that KRR has better accuracy than SVR. We also note that using the GS 
kernel over the simpler BS kernel improve accuracy for both the sup-CK and the sup-CK^ kernels for 
binding pockets. It is surprising that the sup-CK^ kernel does not outperform the sup-CK kernel on both 
benchmarks, since the addition of the atom partial charges should provide more relevant information to the 
predictor. 

Figures 3(a) and 3(b) present an illustration of the prediction accuracy using sup-CK for the PepX 
Unique dataset and sup-CKi for the PepX All dataset. For illustration purposes, the absolute value of the 
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Figure 3: Predicted values as a function of the true values for the PepX Unique (left) and PepX All (right) 
datasets. A perfect predictor would have all it's predictions lying on the x = y red line. 



binding energy has been plotted. We observe that the predictor has the property of maintaining the ranking, 
such that the peptides with high binding affinity can be identified, an important feature for drug discovery. 
Peptides with the highest affinities are the ones that, ultimately, will serve as precursor drug or scaffold in 
a rational drug design program. 

Experiments showed that a Pearson correlation coefficient of « 1.0 is attainable on the training set when 
using the binding pocket kernel, the GS kernel and a large C value (empirically m 100). This is a strong 
indication that our method has the ability of building a perfect predictor, but the lack of data quality and 
quantity may be responsible for the reduced performance on the testing set. Hence better data will improve 
the quality of the predictor. Initially, biological validation will be necessary but ultimately, when sufficient 
data is gathered, the predictor will provide accurate results that are currently only achievable by high cost 
biological experimentation. 



3.2 Major histocompatibility complex class II (MHC-II) 

3.2.1 Single-target predictions 

We performed a single-target prediction experiment using the dataset proposed by the authors of the RTA 
method [25] . The goal of such experiments was to evaluate the ability of a predictor to predict the binding 
energy (kcal/mol) of an unknown peptide to a specific MHC allotype when training only on peptides binding 
to this allotype. For each of the 16 MHC allotypes, a predictor was trained using kernel ridge regression with 
the GS kernel and a nested cross-validation scheme was used. For comparison purposes, the nested cross- 
validation was done using the 5 predefined cross-validation folds provided in [25] . Again, this is sub-optimal 
from the statistical machine learning perspective, since the known guarantees on the risk of a predictor [12,17] 
normally require that the examples be generated independently from the same distribution. 

Three common metrics were used to compare the methods: the Pearson correlation coefficient (PCC), 
the root mean squared error (RMSE), and the area under the ROC curve (AUC). The PCC and the RMSE 
results are presented in Table 3, AUC values can be found as supplementary material [see Appendix]. The 
PCC results show that our method significantly outperforms the RTA method on 13 out of 16 allotypes 
with a p-value of 0.0308. The inferior results for certain allotypes may be attributed to the small size of 
these datasets. In addition, the RMSE results show that our method clearly outperforms the RTA method 
on all 16 allotypes with a p-value of 0.0005. 
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RMSE (kcal/mol) 


# of examples 


KRR+GS 


RTA 


KRR+GS 


RTA 


DRB1*0101 


0.632 


0.530 


1.20 


1.43 


5648 


DRB1*0301 


0.538 


0.425 


1.16 


1.46 


837 


DRB1*0401 


0.430 


0.340 


1.44 


1.72 


1014 


DRB1*0404 


0.491 


0.487 


1.25 


1.38 


617 


DRB1*0405 


0.530 


0.442 


1.09 


1.35 


642 


DRB1*0701 


0.645 


0.484 


1.24 


1.62 


833 


DRB1*0802 


0.469 


0.412 


1.19 


1.34 


557 


DRB1*0901 


0.303 


0.369 


1.55 


1.68 


551 


DRB1*1101 


0.550 


0.450 


1.17 


1.45 


812 


DRB1*1302 


0.468 


0.464 


1.51 


1.64 


636 


DRB1*1501 


0.502 


0.438 


1.41 


1.53 


879 


DRB3*0101 


0.380 


0.425 


1.03 


1.13 


483 


DRB4*0101 


0.613 


0.522 


1.10 


1.33 


664 


DRB5*0101 


0.541 


0.434 


1.20 


1.57 


835 


H2*IA 6 


0.603 


0.556 


1.00 


1.15 


526 


H2*IA d 


0.325 


0.563 


1.44 


1.53 


306 


Average: 


0.501 


0.459 


1.25 


1.46 





Table 3: Comparison of HLA-DR prediction results on the dataset proposed by the authors of RTA. Best 
results for each metric are highlighted in bold. 

3.2.2 Pan-specific predictions 

To further evaluate the performance of our method and the potential of the GS kernel, pan-specific predictions 
were performed using the dataset proposed by the authors of NetMHCIIpan [26] . The authors proposed a 
new cross-validation scheme called the leave one allele out (LOAO) where all but one allele are used as 
training set and the remaining allele is used as testing set. This is a more challenging problem, as the 
predictor needs to determine the binding affinity of peptides for an allele which was absent in the training 
data. The binding specificity of an allele's interface is commonly characterized using a pseudo sequence 
extracted from the beta chain's sequence [9,11,26]. During our experiments the 21 amino acid pseudo 
sequences were found to be optimal. The 89 amino acid pseudo sequences yielded similar, but slightly 
suboptimal results. For all experiments, the GS kernel was used for the allele pseudo sequences and for the 
peptide sequences. All results were obtained with the same LOAO scheme presented in [26]. For each allele 
an inner LOAO cross-validation was done for the selection of hyper-parameters. 

To assess the performance of the proposed method, the PCC and the RMSE results are shown in Table 4, 
AUC values can be found in the supplementary material [see Appendix]. Since we performed LOAO cross- 
validation, the PCC, RMSE and AUC values were calculated on each test fold individually, thus yielding- 
results for each allele. 

The PCC results show that our method outperforms the MultiRTA [10] (p-value of 0.001) and the 
NetMHCIIpan-2.0 [11] (p-value of 0.0574) methods. Since the dataset contained values in log 50fc transformed 
IC50 (Inhibitory Concentration 50%), the calculation of the RMSE values required converting the predicted 
values to kcal/mol using the method proposed in [25]. 

The RMSE values are only shown for our method and the MultiRTA method, since such values were 
not provided by the authors of NetMHCIIpan-2.0. The RMSE results indicate that our method globally 
outperforms MultiRTA with a p-value of 0.0466. 
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0.337 


1.73 


1.68 
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DRB1*1501 
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0.598 


1.46 


1.57 
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DRB3*0101 


0.654 
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0.474 


1.52 


1.10 
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1.41 


1.61 
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0.722 
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1.60 
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Table 4: Comparison of pan-specific HLA-DR prediction results on the dataset proposed by the authors of 
NetMHCIIpan. Best results for each metric are highlighted in bold. 
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RBF 


RBF 
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0.8782 
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0.9044 
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0.7531 


0.7641 


Cationic 


0.7511 


0.7417 


0.7547 



Table 5: Correlation coefficient (PCC) on the QSAM benchmarks. Best results are highlighted in bold. 
3.3 Quantitative structure affinity model (QSAM) benchmark 

For all datasets, the extended z scale [16] was found to be the optimal amino acids descriptors during cross- 
validation. All the results in this section were thus obtained using the extended z scale for the RBF and 
GS kernels. All peptides within each data set arc of the same length, which is why the RBF kernel can be 
applied, as opposed to the PepX database or the two MHC-II benchmark datasets. Note the RBF kernel 
is a special case of the GS kernel. Hence, the results obtained from our method using the GS kernel were 
likely to be at least as good as those obtained with the RBF kernel. 

Table 5 present the results obtained when applying the method from [16] (SVR learning with the RBF 
kernel) and our method (KRR learning with the GS kernel) . Results with the RBF kernel and KRR are also 
presented to illustrate the gain in accuracy obtained from the more general GS kernel. 

We observed that kernel ridge regression (KRR) had a slight accuracy advantage over support vector 
regression (SVR). Moreover, SVR has one more hyperparametcr to tune than KRR: the e-insensitive pa- 
rameter. Consequently, KRR should be preferred over SVR for requiring a substantially shorter learning 
time. Also, we show in Table 5 that the GS kernel outperforms the RBF kernel on all three QSAM data sets 
(when limiting ourself to KRR). Considering these results, KRR with the GS kernel clearly outperforms the 
method of [16] on all data sets. 
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4 Conclusions 

We have proposed a new kernel designed for small bio-molecules (such as peptides) and pseudo-sequences 
of binding interfaces. The GS kernel is an elegant generalization of eight known kernels for local signals. 
Despite the richness of this new kernel, we have provided a simple and efficient dynamic programming 
algorithm for its exact computation and a linear time algorithm for its approximation. Combined with 
the kernel ridge regression learning algorithm and the binding pocket kernel, the proposed kernel yields 
biologically relevant results on the PcpX database. For the first time, a predictor capable of accurately 
predicting the binding affinity of any peptide to any protein was learned using this database. Our method 
significantly outperformed RTA on the single-target prediction of MHC-II binding peptides. Impressive 
state-of-the-art results were also obtained on the pan-specific MHC-II task, outperforming both MultiRTA 
and NetMHCIIpan-2.0. Moreover, the method was successfully tested on three well studied datasets for the 
quantitative structure affinity model. 

A predictor trained on the whole IEDB database or PDB database, as opposed to benchmark datasets, 
would be a substantial tool for the community. Unfortunately, learning a predictor on very large datasets 
(over 25000 examples) is still a major challenge with most machine learning methods, as the similarity (Gram) 
matrix becomes hard to fit into the memory of most computers. We propose to expand the presented method 
to very large datasets as future work. 
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5 Appendix 

5.1 The proof of theorem 1 

Theorem 1 Let E be an alphabet (say the alphabet of all the amino acids). For each I G {1,..,L}, let 
Ki :Y} x Ti l — > R be a symmetric positive semi- definite kernel. Let A : R — > R be any function that is the 
convolution of another function B : R — > R, i.e., for all z, z' € R, we have 

/+oo 
B(z-t)B'(z' -t) At. 
-oo 

Then, the kernel K defined, for any two strings on the alphabet E 7 as 

l |x|-i l*'H 

K(x,x') = E E E A^-j) Ki(^x i+1 ,..,x i+l ), {x' j+1 ,..,x' j+l fj 

i=l i=0 j=0 

is also symmetric positive semi- definite. 

The proof is based on this well known theorem. 
Theorem 2 [30] 

Let £ be a finite alphabet, a string kernel if : E* x E* — > R is positive semi-definite if and only if it satisfies 
the Mercer's condition, that is 

E ]T /(x)/(x')#(x,x') > 

xGS* x'SS* 

for any function f : S* — > R such that X)xes* / 2 ( x ) < +°° ■ 

Proof of Theorem 1: Clearly, the kernel K is symmetric. So, we only have to show that it satisfies 
the Mercer's Condition of Theorem 2. 

For compactness of the notation, the sub-string (xi + i,x i+2 , ■ . ■ , x i+ i) of a string x will be denoted X] i:i+ j] . 

First note that 

E e /(x)/( X ')^(x,xo = EE E E /(x)/(x')^(x,xo 

xes*x'eE* s=o S '=o xes s x / e ss' 

+co +oo L |x|-i |x'|-i 

= EEE E EE E^ i -^/w/( x ')^(x 1W ] ; xU) 

s=0 s'=0 xGE s x'eS 3 ' ' =1 i=0 J=° 

-|_QQ _[_QQ S~~l S / 

= EEE E EEE^-^wm^'W 

s=0s'^0xeE s x / e s s ' *=1 
L +00 s— I +oo s' — i 

= IlZEEEE^-j) E E /(x)/(xo^(x ]M+i] ,x^. +i] ) w 

1=1 s=0 i=0 s'=0 j=0 xe£ s x ' e £s' 
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Note that the last part of the last line of Equation (★) can be rewritten as follows: 



A (*-j)J2 E /( x )/( x 0^x]M+i]> x L w] ) 

= ^-i)E EE E /(x)/(x')^(y,y,) 

yGS'y'eSi' xeE<*|x ]i:i+!] =y x'GS d |xf i:i+i] =y' 

/+oo 
B(i-t)B(j-t)dt E E E E /(x)/(x')^(y,y,), (**) 

~°° yeS'y'eS' xe£ d |x ]i:I + !] =y x'£E<i|x ] '. : . +!] =y' 

where x e S d | X] i:i+; ] = y means that the strings x is any string of length d that contains the substring y 
starting at its position i + Now, combining (★) and (**), and putting 



+00 s—l 

9u(y) = EE 5 * 8 -*) E /(y)> 

s=0 i=0 xeE d |x ]i:I + !] =y 

for any string y, any real number t and any / € {1, 2, . . . , L}, we obtain 
£ £ /(x)/(x')tf(x,x') 

= E E ^(y.y')EEEE / fl(<-w-t)* 

( = 1 (y,y')e£'x£< s = 1 = s'=0 i=o J ^ 00 



(16) 



E /(*) 

xeS d |x ]i:I+!] =y 



E /(-') 



E 

1=1 



+00 



E ^(y,y') 

, (y,y')eS^xS i 



+00 s- 



EE^-*) E /(*) 

xeE d |x ]l:i+i] =y 



s=0 i=0 



EE B (i-') E /(*') 

s '=0j=0 x'eS<i|xf 3:j+i] =y' 



dt 



- E 



i=i 



+00 



E Ki{y,y')g u {y)g u {y) ] di 
v(y,y')es ; xs ! 



> 0. 



The last line is a consequence of the fact that for every I, X)( y y ')es ! xs ! Kl(y>y')9t,l(y)9t,l{y) > 0. Indeed, 
this immediately follows from Theorem 2, because all the K{s have been supposed positive semi-definite. 
Hence, again by Theorem 2, K itself is also positive semi-definite. □ 



20 



5.2 AUC results for experiments on MHC-II 

Calculating the area under the ROC curve (AUC) [22] for regression problems requires transforming the initial 
problem into a classification problem [26]. In the case of MHC-peptide binding affinities, the classification 
problem aims to distinguish between binders and non-binders. To achieve this, a threshold value allowing 
to distinguish binders from non-binders is required. 

5.2.1 Single-target experiment 

For this experiment, the predicted values were binding energies in kcal/mol. As proposed in [26], we set 
a binding affinity threshold of 500nM. Therefore, the threshold value (t) in nanomolar was converted to 
kcal/mol using the technique proposed in [25]: 

t = -0.586 x Zo.g(500 x 1(T 9 ) = 8.50207kcal/mol. 

To calculate the AUC, we converted all the binding energies in the dataset to binary classes based on 
this binding threshold. Then, for all examples, we predicted the binding energy value (e) and generated a 
confidence value that the given example was a binder. This confidence value is given by: c — e — t and then 
normalized using all other confidence values to be in the range [0, 1]. The latter were used to calculate the 
AUC for the experiment. 

AUC 



MHC (3 chain 


KRR+GS 


RTA 


# of examples 


DRB1*0101 


0.838 


0.749 


5648 


DRB1*0301 


0.781 


0.762 


837 


DRB1*0401 


0.753 


0.715 


1014 


DRB1*0404 


0.786 


0.792 


617 


DRB1*0405 


0.782 


0.757 


642 


DRB1*0701 


0.845 


0.790 


833 


DRB1*0802 


0.724 


0.747 


557 


DRB1*0901 


0.665 


0.711 


551 


DRB1*1101 


0.830 


0.753 


812 


DRB1*1302 


0.708 


0.765 


636 


DRB1*1501 


0.740 


0.736 


879 


DRB3*0101 


0.716 


0.825 


483 


DRB4*0101 


0.831 


0.799 


664 


DRB5*0101 


0.826 


0.732 


835 


H2*IA b 


0.860 


0.828 


526 


H2*IA d 


0.772 


0.814 


306 


Average: 


0.779 


0.767 





Table 6: Results of the comparison between AUC values for binding energy predictions obtained from Kernel 
Ridge Regression and the GS Kernel versus the RTA [25] method on the dataset proposed by the authors of 
this method. 



21 



5.2.2 Pan-specific experiment 

The AUC for this experiment was calculated using confidence values as explained above. A threshold of 500 
nM was used to discriminate binders from non-binders [26]. 

AUC 



MHC (3 chain 


KRR+GS 


MultiRTA 


NetMHCIIpan-2.0 


# of examples 


DRB1*0101 


0.807 


0.801 


0.794 


5166 


DRB1*0301 


0.775 


0.751 


0.792 


1020 


DRB1*0401 


0.802 


0.763 


0.802 


1024 


DRB1*0404 


0.862 


0.835 


0.869 


663 


DRB 1*0405 


0.827 


0.808 


0.823 


630 


DRB1*0701 


0.891 


0.817 


0.886 


853 


DRB1*0802 


0.840 


0.786 


0.869 


420 


DRB1*0901 


0.685 


0.674 


0.684 


530 


DRB1*1101 


0.900 


0.819 


0.875 


950 


DRB1*1302 


0.648 


0.698 


0.648 


498 


DRB1*1501 


0.773 


0.729 


0.769 


934 


DRB3*0101 


0.690 


0.813 


0.733 


549 


DRB4*0101 


0.759 


0.746 


0.762 


446 


DRB5*0101 


0.888 


0.788 


0.879 


924 


Average: 


0.796 


0.773 


0.800 





Table 7: Results of the comparison between AUC values for binding affinity predictions obtained from Kernel 
Ridge Regression and the GS Kernel versus the MultiRTA [10] and the NetMHCIIpan-2.0 [11] methods on 
the dataset proposed by the authors of NetMHCIIpan [26] . 
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